Skip to content

Conversation

lgeiger
Copy link

@lgeiger lgeiger commented Sep 23, 2025

Image.tobytes() is used in the __array_interface__ when images are passed to numpy using np.asarray(...). Converting PIL images to numpy is very common, e.g. ML libraries like vllm or some pytorch dataloaders also commonly rely on this process.

For large images this can be quite slow and can become a bottleneck since .tobytes() encodes the data in fixed chunks which need to be joined afterwards:

Pillow/src/PIL/Image.py

Lines 798 to 808 in d42e537

output = []
while True:
bytes_consumed, errcode, data = e.encode(bufsize)
output.append(data)
if errcode:
break
if errcode < 0:
msg = f"encoder error {errcode} in tobytes"
raise RuntimeError(msg)
return b"".join(output)

This PR increases the buffersize to match the image size when using the default raw encoder instead of using a fixed value. In most cases this allows to encode the image in a single chunk which speeds up encoding of large images by over 2x:

image size main this PR main / this PR
128x128 4.54 μs 4.65 μs 0.98
256x256 18.2 μs 13.2 μs 1.38
512x512 60.2 μs 46.1 μs 1.31
1024x1024 382 μs 245 μs 1.56
2048x2048 1.97 ms 1.16 ms 1.70
4096x4096 10.6 ms 5.49 ms 1.93
8192x8192 54.3 ms 22.8 ms 2.38
16384x16384 230 ms 92.3 ms 2.49

Benchmarked with the following ipython script:

import numpy as np
from PIL import Image

for size in (128, 256, 512, 1024, 2048, 4096, 8192, 16384):
    img = np.random.randint(0, 256, size=(size, size, 3), dtype=np.uint8)
    img = Image.fromarray(img)

    print(f"{size}x{size}")
    %timeit img.tobytes()

@radarhere
Copy link
Member

Just to mention for anyone else reading this - individual users can already vary their individual experience by setting MAXBLOCK themselves.

from PIL import ImageFile
ImageFile.MAXBLOCK = 65536 * 4

@wiredfool
Copy link
Member

Have you tried using the Arrow interface, which is zero-copy?

@lgeiger
Copy link
Author

lgeiger commented Sep 23, 2025

Have you tried using the Arrow interface, which is zero-copy?

My use case still requires a numpy array as output (or at least access to the raw bytes). How would I do this from a user perspective with the Arrow interface? Currently I'm just using np.asarray(img) which calls .tobytes() under the hood.

@radarhere
Copy link
Member

I would guess

import numpy as np
import pyarrow as pa
from PIL import Image
img = Image.new("RGB", (12, 12))
np.array(pa.array(img))

but it doesn't seem faster to me.

@lgeiger
Copy link
Author

lgeiger commented Sep 30, 2025

np.array always does a copy so we wouldn't really gain much. pa.array(img).to_numpy(zero_copy_only=True) doesn't seem to work in my case so zero_copy_only=False or np.array() would be needed which seems to be very slow.

Here's a quick benchmark with this PR:

import numpy as np
import pyarrow as pa
from PIL import Image

rng = np.random.default_rng(42)

for size in (128, 256, 512, 1024, 2048, 4096, 8192, 16384):
    img = rng.integers(0, 256, size=(size, size, 3), dtype=np.uint8)
    img = Image.fromarray(img)

    print(f"{size}x{size}")
    %timeit img.tobytes()
    %timeit np.asarray(img)
    %timeit pa.array(img)
    %timeit pa.array(img).to_numpy(zero_copy_only=False)
128x128
4.61 μs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.53 μs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.28 μs ± 4.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.4 ms ± 3.29 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
256x256
13.1 μs ± 86.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
13.9 μs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.3 μs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
5.86 ms ± 14.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
512x512
45.8 μs ± 77.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
46.8 μs ± 95.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
1.27 μs ± 9.67 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
24 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1024x1024
246 μs ± 1.82 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
263 μs ± 6.77 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.28 μs ± 5.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
97.8 ms ± 215 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2048x2048
1.15 ms ± 13.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.16 ms ± 15 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.28 μs ± 6.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
395 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For image sizes of 4096x4096 it also seems like pyarrow fails with ValueError: Image is in multiple array blocks, use imaging_new_block for zero copy.

So in summary using pyarrow arrays are much faster, but going from pyarrow to numpy is very slow.

@radarhere
Copy link
Member

pa.array(img).to_numpy(zero_copy_only=True) doesn't seem to work in my case

Just in case it is something interesting that we should consider in the future, could you explain this slightly more?

@lgeiger
Copy link
Author

lgeiger commented Oct 1, 2025

pa.array(img).to_numpy(zero_copy_only=True) doesn't seem to work in my case

Just in case it is something interesting that we should consider in the future, could you explain this slightly more?

The following code would raise ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

img = Image.fromarray(np.random.randint(0, 255, size=(128, 128, 3), dtype=np.uint8))
arr = pa.array(img)
np_arr = pa.array(img).to_numpy()

I haven't used pyarrow before, but judging from the docs it raises because "the conversion to a numpy array would require copying the underlying data (e.g. in presence of nulls, or for non-primitive types) " and since zero_copy_only was True.

@lgeiger
Copy link
Author

lgeiger commented Oct 3, 2025

Just to mention for anyone else reading this - individual users can already vary their individual experience by setting MAXBLOCK themselves.

The main benefit of this PR is that since the buffer size is dependent on the image size users will still have low memory usage for small images but benefit from a larger buffer size for large images.

Let me know if you have any concerns that would prevent merging of the PR.

@radarhere radarhere requested a review from wiredfool October 6, 2025 01:20
@lgeiger
Copy link
Author

lgeiger commented Oct 9, 2025

@radarhere Any updates on when/whether this PR could be merged? Or are there any additional benchmarks that you would like me to run?

@aclark4life
Copy link
Member

@lgeiger We'll probably need @wiredfool to look closer too.

@wiredfool
Copy link
Member

The original point of this particular bit of code was to have predictable memory usage when running tobytes, in this case, either Image.MAX_BLOCK or at the very least, the size of one row. (as the shuffler is row based, and if the buffer is smaller than the row, no progress can be made.)

So, where previously we needed 2xImageMemory + 64k, now we need 3xImageMemory.

For smaller images, it's not a problem, but for larger images this may cause memory pressure where we didn't have it before. I'd consider that a regression.

One alternate here is to change the calculation so that it's the min of max(MAX_BLOCK, row_size) and the image size, at which point the MAX_BLOCK can be boosted without allocating excessive memory in the small image case.

There may other places where the MAX_BLOCK shouldn't be large, and they'd need some similar checks.

@lgeiger
Copy link
Author

lgeiger commented Oct 10, 2025

@wiredfool Thanks for taking a look. I agree increasing MAXBLOCK globally would be a problem since it increases memory usage, but for tobytes() I'm not sure this is actually the case.

The way I understand the code is the following: All chunks are appended to a list which is joined afterwards which causes the 2x ImageMemory usage on the Python side you mentioned above.

Pillow/src/PIL/Image.py

Lines 798 to 808 in 6d6f049

output = []
while True:
bytes_consumed, errcode, data = e.encode(bufsize)
output.append(data)
if errcode:
break
if errcode < 0:
msg = f"encoder error {errcode} in tobytes"
raise RuntimeError(msg)
return b"".join(output)

I don't think the max memory usage would include the additional 64k buffer, but I haven't looked at what the C code is actually doing so I might be wrong.

In any case, for this PR the output list only consists of a single item (assuming the buffer size was correct/large enough) preventing the need for allocating a new bytes object during the join. So the memory usage of the Python code would actually half. I thought about changing the code to directly return data but I wasn't sure whether I'm missing any edge cases where the the actual size would be larger than the buffer size estimate added here.

I double checked this with a memory profile and viewed the memory usage with memray summary <filename>:

import memray
import numpy as np
from PIL import Image

rng = np.random.default_rng(42)

def get_image(size):
    return Image.fromarray(rng.integers(0, 256, size=(size, size, 3), dtype=np.uint8))

for size in (512, 1024, 2048, 4096, 8192, 16384):
    img = get_image(size)
    with memray.Tracker(f"pr_{size}.bin"):
        img.tobytes()

And the results show that this PR halves the memory usage which matches my theory from above:

image size memory (main) memory (this PR) allocations (main) allocations (this PR)
512x512 1.576MB 788.001kB 16 17
1024x1024 6.300MB 3.149MB 52 2
2048x2048 25.197MB 12.589MB 209 2
4096x4096 100.775MB 50.344MB 824 2
8192x8192 403.174MB 201.351MB 4100 2
16384x16384 1.613GB 805.356MB 16388 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants